Document Decomposition of Bangla Printed Text

نویسندگان

  • Md. Fahad Hasan
  • Tasmin Afroz
  • Sabir Ismail
  • Md. Saiful Islam
چکیده

skew, Auto rotation. Abstract: Today all kind of information is getting digitized and along with all this digitization, the huge archive of various kinds of documents is being digitized too. We know that, Optical Character Recognition is the method through which, newspapers and other paper documents convert into digital resources. But, it is a fact that this method works on texts only. As a result, if we try to process any document which contains non-textual zones, then we will get garbage texts as output. That is why; in order to digitize documents properly they should be preprocessed carefully. And while preprocessing, segmenting document in different regions according to the category properly is most important. But, the Optical Character Recognition processes available for Bangla language have no such algorithm that can categorize a newspaper/book page fully. So we worked to decompose a document into its several parts like headlines, sub headlines, columns, images etc. And if the input is skewed and rotated, then the input was also deskewed and de-rotated. To decompose any Bangla document we found out the edges of the input image. Then we find out the horizontal and vertical area of every pixel where it lies in. Later on the input image was cut according to these areas. Then we pick each and every sub image and found out their height-width ratio, line height. Then according to these values the sub images were categorized. To de skew the image we found out the skew angle and de skewed the image according to this angle. To de-rotate the image we used the line height, matra line, pixel ratio of matra line.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Wavelet Packet Based Texture Features for Automatic Script Identification

In a multi script environment, an archive of documents printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify the script type of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documents printed in ten Indian scripts ...

متن کامل

A New Approach to Bangla Text Extraction and Recognition From Textual Image

This paper presents a new approach to segment and recognize Printed Bangla Text using Characteristic functions and Hamming network. The main difficulties in printed Bangla text recognition are the separation of lines, words and individual characters. In this paper, a new algorithm has been proposed to detect and separate text lines, words and characters from printed Bangla text. The algorithm u...

متن کامل

A Survey on Script Segmentation for Bangla OCR

Script segmentation is an important primary task for any Optical Character Recognition (OCR) software. Especially, in case of off-line OCR for printed character, it has more importance. Through script segmentation a big image of some written document is fragmented into a number of small pieces which are then used for pattern matching to determine the expected sequence of characters. In the impl...

متن کامل

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

Supervised learning Methods for Bangla Web Document Categorization

This paper explores the use of machine learning approaches, or more specifically, four supervised learning Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Naïve Bays (NB), and Support Vector Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of documents into categories from a predefined set. Whereas a wide range of methods h...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1701.08706  شماره 

صفحات  -

تاریخ انتشار 2016